95 research outputs found
GOGGLES: Automatic Image Labeling with Affinity Coding
Generating large labeled training data is becoming the biggest bottleneck in
building and deploying supervised machine learning models. Recently, the data
programming paradigm has been proposed to reduce the human cost in labeling
training data. However, data programming relies on designing labeling functions
which still requires significant domain expertise. Also, it is prohibitively
difficult to write labeling functions for image datasets as it is hard to
express domain knowledge using raw features for images (pixels).
We propose affinity coding, a new domain-agnostic paradigm for automated
training data labeling. The core premise of affinity coding is that the
affinity scores of instance pairs belonging to the same class on average should
be higher than those of pairs belonging to different classes, according to some
affinity functions. We build the GOGGLES system that implements affinity coding
for labeling image datasets by designing a novel set of reusable affinity
functions for images, and propose a novel hierarchical generative model for
class inference using a small development set.
We compare GOGGLES with existing data programming systems on 5 image labeling
tasks from diverse domains. GOGGLES achieves labeling accuracies ranging from a
minimum of 71% to a maximum of 98% without requiring any extensive human
annotation. In terms of end-to-end performance, GOGGLES outperforms the
state-of-the-art data programming system Snuba by 21% and a state-of-the-art
few-shot learning technique by 5%, and is only 7% away from the fully
supervised upper bound.Comment: Published at 2020 ACM SIGMOD International Conference on Management
of Dat
Rethinking Similarity Search: Embracing Smarter Mechanisms over Smarter Data
In this vision paper, we propose a shift in perspective for improving the
effectiveness of similarity search. Rather than focusing solely on enhancing
the data quality, particularly machine learning-generated embeddings, we
advocate for a more comprehensive approach that also enhances the underpinning
search mechanisms. We highlight three novel avenues that call for a
redefinition of the similarity search problem: exploiting implicit data
structures and distributions, engaging users in an iterative feedback loop, and
moving beyond a single query vector. These novel pathways have gained relevance
in emerging applications such as large-scale language models, video clip
retrieval, and data labeling. We discuss the corresponding research challenges
posed by these new problem areas and share insights from our preliminary
discoveries
Nearest Neighbor Classifiers over Incomplete Information: From Certain Answers to Certain Predictions
Machine learning (ML) applications have been thriving recently, largely
attributed to the increasing availability of data. However, inconsistency and
incomplete information are ubiquitous in real-world datasets, and their impact
on ML applications remains elusive. In this paper, we present a formal study of
this impact by extending the notion of Certain Answers for Codd tables, which
has been explored by the database research community for decades, into the
field of machine learning. Specifically, we focus on classification problems
and propose the notion of "Certain Predictions" (CP) -- a test data example can
be certainly predicted (CP'ed) if all possible classifiers trained on top of
all possible worlds induced by the incompleteness of data would yield the same
prediction.
We study two fundamental CP queries: (Q1) checking query that determines
whether a data example can be CP'ed; and (Q2) counting query that computes the
number of classifiers that support a particular prediction (i.e., label). Given
that general solutions to CP queries are, not surprisingly, hard without
assumption over the type of classifier, we further present a case study in the
context of nearest neighbor (NN) classifiers, where efficient solutions to CP
queries can be developed -- we show that it is possible to answer both queries
in linear or polynomial time over exponentially many possible worlds.
We demonstrate one example use case of CP in the important application of
"data cleaning for machine learning (DC for ML)." We show that our proposed
CPClean approach built based on CP can often significantly outperform existing
techniques in terms of classification accuracy with mild manual cleaning
effort
Experiences and Lessons Learned from the SIGMOD Entity Resolution Programming Contests
We report our experience in running three editions (2020, 2021, 2022) of the SIGMOD programming contest, a well-known event for students to engage in solving exciting data management problems. During this period we had the opportunity of introducing participants to the entity resolution task, which is of paramount importance in the data integration community. We aim at sharing the executive decisions, made by the people co-authoring this report, and the lessons learned
FAST discovery of a fast neutral hydrogen outflow
In this letter, we report the discovery of a fast neutral hydrogen outflow in
SDSS J145239.38+062738.0, a merging radio galaxy containing an optical type I
active galactic nuclei (AGN). This discovery was made through observations
conducted by the Five-hundred-meter Aperture Spherical radio Telescope (FAST)
using redshifted 21-cm absorption. The outflow exhibits a blueshifted velocity
likely up to with respect to the systemic velocity
of the host galaxy with an absorption strength of corresponding to an optical depth of 0.002 at . The mass outflow rate ranges between and , implying an energy outflow rate ranging between
and , assuming 100 K
1000 K. Plausible drivers of the outflow include the star bursts,
the AGN radiation, and the radio jet, the last of which is considered the most
likely culprit according to the kinematics. By analysing the properties of the
outflow, the AGN, and the jet, we find that if the HI outflow is driven by the
AGN radiation, the AGN radiation seems not powerful enough to provide negative
feedback whereas the radio jet shows the potential to provide negative
feedback. Our observations contribute another example of a fast outflow
detected in neutral hydrogen, as well as demonstrate the capability of FAST in
detecting such outflows.Comment: Accepted by ApJ
Does a radio jet drive the massive multi-phase outflow in the ultra-luminous infrared galaxy IRAS 10565+2448?
We present new upgraded Giant Metrewave Radio Telescope (uGMRT) HI 21-cm
observations of the ultra-luminous infrared galaxy IRAS 10565+2448, previously
reported to show blueshifted, broad, and shallow HI absorption indicating an
outflow. Our higher spatial resolution observations have localised this
blueshifted outflow, which is 1.36 kpc southwest of the radio centre and
has a blueshifted velocity of and a full width at
half maximum (FWHM) of . The spatial extent and
kinematic properties of the HI outflow are consistent with the previously
detected cold molecular outflows in IRAS 10565+2448, suggesting that they
likely have the same driving mechanism and are tracing the same outflow. By
combining the multi-phase gas observations, we estimate a total outflowing mass
rate of at least and a total energy loss rate of
at least , where the contribution from the
ionised outflow is negligible, emphasising the importance of including both
cold neutral and molecular gas when quantifying the impact of outflows. We
present evidence of the presence of a radio jet and argue that this may play a
role in driving the observed outflows. The modest radio luminosity
of the jet in IRAS
10565+2448 implies that the jet contribution to driving outflows should not be
ignored in low radio luminosity AGN.Comment: 12 pages, 9 figures, accepted for publication in MNRA
Theoretical Investigations into Self-Organized Ordered Metallic Semi-Clusters Arrays on Metallic Substrate
Using the energy minimization calculations based on an interfacial potential and a first-principles total energy method, respectively, we show that (2 × 2)/(3 × 3) Pb/Cu(111) system is a stable structure among all the [(n − 1) × (n − 1)]/(n × n) Pb/Cu(111) (n = 2, 3,…, 12) structures. The electronic structure calculations indicate that self-organized ordered Pb semi-clusters arrays are formed on the first Pb monolayer of (2 × 2)/(3 × 3) Pb/Cu(111), which is due to a strain-release effect induced by the inherent misfits. The Pb semi-clusters structure can generate selective adsorption of atoms of semiconductor materials (e.g., Ge) around the semi-clusters, therefore, can be used as a template for the growth of nanoscale structures with a very short periodic length (7.67 Å)
- …